---
title: "Overview"
description: "Understanding how to evaluate and measure AI agent quality using Kastrax evals."
---

# Testing your agents with evals ✅

While traditional software tests have clear pass/fail conditions, AI outputs are non-deterministic — they can vary with the same input. Evals help bridge this gap by providing quantifiable metrics for measuring agent quality.

Evals are automated tests that evaluate Agents outputs using model-graded, rule-based, and statistical methods. Each eval returns a normalized score between 0-1 that can be logged and compared. Evals can be customized with your own prompts and scoring functions.

Evals can be run in the cloud, capturing real-time results. But evals can also be part of your CI/CD pipeline, allowing you to test and monitor your agents over time.

## Types of Evals ✅

There are different kinds of evals, each serving a specific purpose. Here are some common types:

1. **Textual Evals**: Evaluate accuracy, reliability, and context understanding of agent responses
2. **Classification Evals**: Measure accuracy in categorizing data based on predefined categories
3. **Tool Usage Evals**: Assess how effectively an agent uses external tools or APIs
4. **RAG Evals**: Evaluate retrieval accuracy and relevance in RAG systems
5. **Conversation Evals**: Measure the quality of multi-turn conversations

## Getting Started with Evals ✅

To start using evals in your Kastrax project, you'll need to add the evals module to your dependencies:

```kotlin
// build.gradle.kts
dependencies {
    implementation("ai.kastrax:kastrax-core:0.1.0")
    implementation("ai.kastrax:kastrax-evals:0.1.0")
    // Other dependencies...
}
```

Here's a simple example of how to create and run an eval:

```kotlin
import ai.kastrax.core.agent.agent
import ai.kastrax.evals.textual.FactualAccuracyEval
import ai.kastrax.integrations.deepseek.deepSeek
import ai.kastrax.integrations.deepseek.DeepSeekModel
import kotlinx.coroutines.runBlocking

fun main() = runBlocking {
    // Create an agent to evaluate
    val myAgent = agent {
        name("TestAgent")
        description("A test agent for evaluation")
        
        model = deepSeek {
            apiKey("your-deepseek-api-key")
            model(DeepSeekModel.DEEPSEEK_CHAT)
            temperature(0.7)
        }
    }
    
    // Create an eval
    val factualAccuracyEval = FactualAccuracyEval()
    
    // Define test cases
    val testCases = listOf(
        "What is the capital of France?",
        "Who wrote 'Pride and Prejudice'?",
        "What is the chemical formula for water?"
    )
    
    // Run the eval
    val results = factualAccuracyEval.evaluate(myAgent, testCases)
    
    // Print results
    println("Factual Accuracy Score: ${results.score}")
    println("Individual Scores:")
    results.individualScores.forEachIndexed { index, score ->
        println("  ${testCases[index]}: $score")
    }
}
```

## Built-in Evals ✅

Kastrax provides several built-in evals:

### Textual Evals ✅

```kotlin
// Factual Accuracy Eval
val factualAccuracyEval = FactualAccuracyEval()

// Relevance Eval
val relevanceEval = RelevanceEval()

// Coherence Eval
val coherenceEval = CoherenceEval()

// Toxicity Eval
val toxicityEval = ToxicityEval()

// Bias Eval
val biasEval = BiasEval()
```

### Classification Evals ✅

```kotlin
// Classification Accuracy Eval
val classificationAccuracyEval = ClassificationAccuracyEval(
    categories = listOf("Positive", "Negative", "Neutral")
)

// Multi-label Classification Eval
val multiLabelEval = MultiLabelClassificationEval(
    labels = listOf("Technology", "Science", "Politics", "Sports")
)
```

### Tool Usage Evals ✅

```kotlin
// Tool Selection Eval
val toolSelectionEval = ToolSelectionEval(
    availableTools = listOf("calculator", "weather", "search")
)

// Tool Parameter Eval
val toolParameterEval = ToolParameterEval()

// Tool Execution Eval
val toolExecutionEval = ToolExecutionEval()
```

### RAG Evals ✅

```kotlin
// Retrieval Precision Eval
val retrievalPrecisionEval = RetrievalPrecisionEval()

// Retrieval Recall Eval
val retrievalRecallEval = RetrievalRecallEval()

// Answer Relevance Eval
val answerRelevanceEval = AnswerRelevanceEval()

// Citation Accuracy Eval
val citationAccuracyEval = CitationAccuracyEval()
```

## Creating Custom Evals ✅

You can create custom evals by implementing the `Eval` interface:

```kotlin
import ai.kastrax.core.agent.Agent
import ai.kastrax.evals.Eval
import ai.kastrax.evals.EvalResult

class CustomEval : Eval<String> {
    override val name: String = "CustomEval"
    override val description: String = "A custom evaluation metric"
    
    override suspend fun evaluate(agent: Agent, testCases: List<String>): EvalResult {
        val individualScores = mutableListOf<Double>()
        
        for (testCase in testCases) {
            // Generate a response from the agent
            val response = agent.generate(testCase)
            
            // Implement your custom scoring logic
            val score = calculateScore(testCase, response.text)
            
            individualScores.add(score)
        }
        
        // Calculate the overall score (average of individual scores)
        val overallScore = individualScores.average()
        
        return EvalResult(
            score = overallScore,
            individualScores = individualScores,
            metadata = mapOf("evalName" to name)
        )
    }
    
    private fun calculateScore(testCase: String, response: String): Double {
        // Implement your custom scoring logic
        // Return a score between 0 and 1
        
        // Example: Simple length-based scoring (for demonstration only)
        return minOf(response.length / 100.0, 1.0)
    }
}
```

## Model-Graded Evals ✅

Model-graded evals use an LLM to evaluate agent responses:

```kotlin
import ai.kastrax.core.agent.Agent
import ai.kastrax.evals.ModelGradedEval
import ai.kastrax.integrations.deepseek.deepSeek
import ai.kastrax.integrations.deepseek.DeepSeekModel

class HelpfulnessEval : ModelGradedEval<String> {
    override val name: String = "HelpfulnessEval"
    override val description: String = "Evaluates how helpful the agent's responses are"
    
    override val evaluationModel = deepSeek {
        apiKey("your-deepseek-api-key")
        model(DeepSeekModel.DEEPSEEK_CHAT)
        temperature(0.1) // Low temperature for consistent evaluation
    }
    
    override val evaluationPrompt = """
        You are evaluating the helpfulness of an AI assistant's response to a user query.
        
        User Query: {{query}}
        
        Assistant Response: {{response}}
        
        Rate the helpfulness of the response on a scale from 0 to 10, where:
        - 0: Not helpful at all, completely irrelevant or incorrect
        - 5: Somewhat helpful, but missing important information or not fully addressing the query
        - 10: Extremely helpful, fully addresses the query with accurate and comprehensive information
        
        Provide your rating as a number between 0 and 10, followed by a brief explanation.
    """.trimIndent()
    
    override fun parseScore(evaluationResponse: String): Double {
        // Extract the numerical score from the evaluation response
        val scoreRegex = """(\d+)""".toRegex()
        val matchResult = scoreRegex.find(evaluationResponse)
        
        return if (matchResult != null) {
            val score = matchResult.groupValues[1].toInt()
            score / 10.0 // Normalize to 0-1 range
        } else {
            0.5 // Default score if parsing fails
        }
    }
}
```

## Running Evals in CI/CD ✅

You can integrate evals into your CI/CD pipeline:

```kotlin
import ai.kastrax.core.agent.agent
import ai.kastrax.evals.EvalSuite
import ai.kastrax.evals.textual.FactualAccuracyEval
import ai.kastrax.evals.textual.RelevanceEval
import ai.kastrax.evals.textual.CoherenceEval
import ai.kastrax.integrations.deepseek.deepSeek
import ai.kastrax.integrations.deepseek.DeepSeekModel
import kotlinx.coroutines.runBlocking
import java.io.File

fun main() = runBlocking {
    // Load the agent
    val myAgent = agent {
        name("ProductionAgent")
        description("A production agent for evaluation")
        
        model = deepSeek {
            apiKey(System.getenv("DEEPSEEK_API_KEY"))
            model(DeepSeekModel.DEEPSEEK_CHAT)
            temperature(0.7)
        }
    }
    
    // Create an eval suite
    val evalSuite = EvalSuite(
        name = "ProductionEvalSuite",
        evals = listOf(
            FactualAccuracyEval(),
            RelevanceEval(),
            CoherenceEval()
        )
    )
    
    // Load test cases from a file
    val testCases = File("test_cases.txt").readLines()
    
    // Run the eval suite
    val results = evalSuite.evaluate(myAgent, testCases)
    
    // Check if the results meet the threshold
    val threshold = 0.8
    val passed = results.averageScore >= threshold
    
    if (passed) {
        println("Eval suite passed with score: ${results.averageScore}")
        System.exit(0) // Success
    } else {
        println("Eval suite failed with score: ${results.averageScore} (threshold: $threshold)")
        System.exit(1) // Failure
    }
}
```

## Best Practices ✅

1. **Use Multiple Evals**: Different evals measure different aspects of agent quality
2. **Create Diverse Test Cases**: Include edge cases and challenging scenarios
3. **Set Realistic Thresholds**: Start with lower thresholds and gradually increase them
4. **Monitor Trends**: Track eval scores over time to identify regressions
5. **Combine with Human Evaluation**: Use evals alongside human evaluation for a complete picture

## Next Steps ✅

Now that you understand evals, you can:

1. [Create custom evals](./custom-evals.mdx)
2. [Set up eval suites](./eval-suites.mdx)
3. [Integrate evals into CI/CD](./ci-cd-integration.mdx)
